class: center, middle, inverse, title-slide # Covid-19, Global Pandemic, and Data Science ### Team Chrissy & Ricky
Chrissy Aman and Ricky Sun ### Bates College ### 2022-04-12 --- ## Outline <style type="text/css"> .remark-slide-content { font-size: 30px; padding: 1em 4em 1em 4em; } </style> - Introduction - Literature Review - Our Data - Methods & Data Analyses - Results - Limitations and Potential Future Studies --- class: inverse, center, middle background-image: url("images/cool.png") # Introduction --- # Introduction COVID-19, also known as Coronavirus disease 2019 is a contagious disease caused by a virus, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Up until yesterday, there are over 497,057,239 infections and 6,179,104 deaths since the beginning of the pandemic Although no one can predict a pandemic like this, with the help of data, we might be able to use available data to, for example, evaluate risks, so that the virus and mutations can be better managed or even contained in earlier stages --- ## Research Question - In our research, we are trying to use Covid-19 related data, together with other relevant data to find potential predictors for Covid-19 cases, deaths, or vaccination rates. ? Do vaccinations effectively mitigate the death rate ? Can we implement machine learning algorithm to predict Covid-19 ? Are higher percentage of older people predicts higher death rate ? What about other variables in predicting Covid-19 --- class: inverse, middle, center # Literature Review --- # [1a] Covid-19 severity .pull-left[ <div class="figure" style="text-align: center"> <img src="https://www.mdpi.com/pathogens/pathogens-09-00817/article_deploy/html/images/pathogens-09-00817-g001-550.jpg" alt="Reference: Mulinari T.O., et al. (2020). MDPI" width="80%" /> <p class="caption">Reference: Mulinari T.O., et al. (2020). MDPI</p> </div> ] .pull-right[ - The unpredictability of the progression of coronavirus disease 2019 (COVID-19) may be attributed to the low precision of the tools used to predict the prognosis of this disease, especially when the virus is mutating in a fast speed from alpha, to Omicron, and there are more recent variants too. ] --- # [1b] vitamin D and covid-19 .pull-left[ <div class="figure" style="text-align: center"> <img src="https://www.healio.com/~/media/slack-news/fm_im/misc/infographics/2020/september/pc0920meltzer_graphic_01.jpg?h=630&w=1200&la=en&hash=AA5CE3037B8695511804123FBF351C92" alt="Reference: Meltzer DO, et al. 2021. JAMA Netw Open" width="100%" /> <p class="caption">Reference: Meltzer DO, et al. 2021. JAMA Netw Open</p> </div> ] .pull-right[ - Several studies suggest an association between serum 25-hydroxyvitamin D (25OHD) and the likelihood of suffering severe symptoms of covid-19. ] --- # [2] Covid-19 and weather .pull-left[ - Akin to respiratory tract infection diseases, climatic conditions may significantly influence the COVID-19 pandemic, significant efforts have been made to explore the relationship between climatic condition and growth in number of COVID-19 cases. ] .pull-right[ <div class="figure" style="text-align: center"> <img src="https://ars.els-cdn.com/content/image/1-s2.0-S004896972033179X-ga1.jpg" alt="Reference: Mesay Moges Menebo. (2020)" width="80%" /> <p class="caption">Reference: Mesay Moges Menebo. (2020)</p> </div> ] --- # [3] Covid-19 and social media - Social media data (such as twits or social media indexes) from, for example, google search, twitter, facebook and other social media platform, may also be used to develop models and as early warning signals of COVID-19 outbreaks. Social media data can also presents with people's perception of risks and general mental states of a region. --- # [4] COVID-19 and impacts - Covid-19 also has had great impacts in our daily lives (racial issues, job markets, also economic activities, and so on) ### It is found that an increase in vaccination per capita is associated with a significant increase in economic activity. ### It is also found evidence for nonlinear effects of vaccines: marginal economic benefits when vaccination rates are higher. ### Country-specific conditions play an important role, with lower economic gains if strict containment measures are in place or if the country is experiencing a severe outbreak. --- # [5] Covid-19 and machine learning - Developing accurate forecasting tools will help in our fight against the pandemic. Prediction models that combine several features to estimate the risk of infection have been developed. - These aim to assist medical staff worldwide in triaging patients, especially in the context of limited healthcare resources. --- class: inverse, middle, center # Our Data --- ## Our Data - details Our dataset is coming from "Our World in Data" Covid-19 public data, together with data from JHU, WHO, CDC and World Bank. The data covers a wide range: - Basic Covid-19 data (cases, deaths) - Hospital & ICU (ICU beds, ICU patients) - Policy responses (stringency_index) - Reproduction rate - Tests & positivity - Vaccinations - Others (populations, life_expectancy, GDP per catpita and so on) --- class: inverse, middle, center # Methods & Data Analyses --- # Methods & Data Analyses - details We have three major parts of analyses: 1. preliminary exploration like summary statistics, scatter plots, correlations, maps 2. Regression analyses, ranging from OLS, Diff in Diff, regression continuity 3. Advanced models and machine learning algorithnms --- class: inverse, middle, center # Results & Implications --- # [1a] summary statistics --- # [1a] summary statistics Development (HDI) .pull-left[ - Some text - goes here ] .pull-right[ ``` ## Selecting by human_development_index ``` ``` ## # A tibble: 10 × 2 ## location human_development_index ## <chr> <dbl> ## 1 Norway 0.957 ## 2 Ireland 0.955 ## 3 Switzerland 0.955 ## 4 Hong Kong 0.949 ## 5 Iceland 0.949 ## 6 Germany 0.947 ## 7 Sweden 0.945 ## 8 Australia 0.944 ## 9 Netherlands 0.944 ## 10 Denmark 0.94 ``` ] --- ``` ## Reading layer `TM_WORLD_BORDERS-0.3' from data source ## `/cloud/project/data/world_shape_file/TM_WORLD_BORDERS-0.3.shp' using driver `ESRI Shapefile' ## Simple feature collection with 246 features and 11 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -180 ymin: -90 xmax: 180 ymax: 83.6236 ## Geodetic CRS: WGS 84 ```
--- # Results & Implications [1b: preliminary analyses - correlation and heat map] <img src="presentation_files/figure-html/correlation_heatmap-1.png" width="80%" /> --- # Results & Implications [1c: preliminary analyses - maps] Include dynamic maps for vaccination Alt (alternative) text 1. chart type 2. of type data (x and y, color) 3. reason for including chart --- # Results & Implications [1d: preliminary analyses - scatter plots and other ggplots] ```r covid_data %>% filter(date == "2022-02-20") %>% filter(is.na(continent) == FALSE) %>% mutate(smoker = female_smokers + male_smokers) %>% mutate(vaccination_rate = people_fully_vaccinated/population) %>% mutate(booster = total_boosters/population) %>% ggplot(mapping = aes(x = vaccination_rate, y = human_development_index)) + geom_point(size = 2, mapping = aes()) + labs(title = "vaccination rate vs. human development index", subtitle = "xx", x = "vaccination rate", y = "human development index") + scale_color_viridis_d() + geom_smooth(color = "blue") ``` ``` ## `geom_smooth()` using method = 'loess' and formula 'y ~ x' ``` ``` ## Warning: Removed 131 rows containing non-finite values (stat_smooth). ``` ``` ## Warning: Removed 131 rows containing missing values (geom_point). ``` <img src="presentation_files/figure-html/vaccination-HDI-1.png" title="scatterplot of flipper length by bill length of 3 penguin species, where we show penguins with bigger flippers have bigger bills" alt="scatterplot of flipper length by bill length of 3 penguin species, where we show penguins with bigger flippers have bigger bills" width="80%" /> --- # Results & Implications [1d: preliminary analyses - scatter plots and other ggplots] ```r covid_data %>% filter(date == "2022-02-20") %>% filter(is.na(continent) == FALSE) %>% mutate(smoker = female_smokers + male_smokers) %>% mutate(vaccination_rate = people_fully_vaccinated/population) %>% mutate(death = total_deaths/population) %>% mutate(booster = total_boosters/population) %>% ggplot(mapping = aes(x = vaccination_rate, y = death)) + geom_point(size = 1, mapping = aes()) + labs(title = "percentage death of population vs. diabetes prevalence", subtitle = "xx", x = "percentage death of population", y = "diabetes prevalence") + scale_color_viridis_d() + geom_smooth(color = "blue") ``` ``` ## `geom_smooth()` using method = 'loess' and formula 'y ~ x' ``` ``` ## Warning: Removed 128 rows containing non-finite values (stat_smooth). ``` ``` ## Warning: Removed 128 rows containing missing values (geom_point). ``` <img src="presentation_files/figure-html/cases-diabetes-1.png" width="80%" /> --- # Results & Implications [1e: time series analyses] --- ``` ## Warning in location == c("China", "Japan", "India", "Brazil", "France", : longer object length ## is not a multiple of shorter object length ```
--- # Results & Implications [1e: time series analyses] ``` ## Warning: Ignoring unknown parameters: fill ``` <img src="presentation_files/figure-html/ARiMA-1.png" width="80%" /> ``` ## Warning: Ignoring unknown parameters: fill ``` ``` ## Warning: Removed 7 row(s) containing missing values (geom_path). ``` <img src="presentation_files/figure-html/ARiMA-2.png" width="80%" /> ```r cases_diff <- diff(covid_time$new_cases, differences=1) acf(cases_diff, lag.max=20) ``` <img src="presentation_files/figure-html/unnamed-chunk-2-1.png" width="80%" /> ```r acf(cases_diff, lag.max=20, plot=FALSE) ``` ``` ## ## Autocorrelations of series 'cases_diff', by lag ## ## 0 1 2 3 4 5 6 7 8 9 10 11 12 ## 1.000 -0.326 -0.245 0.095 0.090 -0.242 -0.149 0.608 -0.081 -0.230 0.004 0.121 -0.178 ## 13 14 15 16 17 18 19 20 ## -0.231 0.612 -0.130 -0.168 -0.003 0.113 -0.211 -0.133 ``` ```r pacf(cases_diff, lag.max=20) # plot a partial correlogram ``` <img src="presentation_files/figure-html/unnamed-chunk-3-1.png" width="80%" /> ```r pacf(cases_diff, lag.max=20, plot=FALSE) # get the partial autocorrelation values ``` ``` ## ## Partial autocorrelations of series 'cases_diff', by lag ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 ## -0.326 -0.393 -0.192 -0.056 -0.300 -0.528 0.247 0.347 0.332 0.150 0.129 0.159 -0.234 ## 14 15 16 17 18 19 20 ## 0.132 -0.124 -0.082 -0.088 -0.027 -0.123 -0.091 ``` ```r auto.arima(cases_diff) ``` ``` ## Series: cases_diff ## ARIMA(2,0,2) with zero mean ## ## Coefficients: ## ar1 ar2 ma1 ma2 ## 0.9022 -0.4360 -1.6963 0.8972 ## s.e. 0.0377 0.0421 0.0288 0.0184 ## ## sigma^2 = 4.093e+09: log likelihood = -9525.58 ## AIC=19061.16 AICc=19061.24 BIC=19084.35 ``` ```r covid_time_arima <- arima(covid_time$new_cases, order=c(2,0,2)) covid_time_arima ``` ``` ## ## Call: ## arima(x = covid_time$new_cases, order = c(2, 0, 2)) ## ## Coefficients: ## ar1 ar2 ma1 ma2 intercept ## 0.3710 0.6069 0.0512 -0.5558 103166.49 ## s.e. 0.0803 0.0790 0.0687 0.0449 52848.56 ## ## sigma^2 estimated as 5.029e+09: log likelihood = -9618.34, aic = 19248.68 ``` ```r fit <- Arima(cases_diff, order=c(2,0,2)) checkresiduals(fit) ``` <img src="presentation_files/figure-html/unnamed-chunk-6-1.png" width="80%" /> ``` ## ## Ljung-Box test ## ## data: Residuals from ARIMA(2,0,2) with non-zero mean ## Q* = 258.4, df = 5, p-value < 2.2e-16 ## ## Model df: 5. Total lags used: 10 ``` ```r autoplot(forecast(fit)) ``` <img src="presentation_files/figure-html/unnamed-chunk-6-2.png" width="80%" /> --- # Results & Implications - [2a: regression analyses - simple linear regression] ``` ## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -550. 825. -0.667 0.522 ## 2 vaccination -559. 1558. -0.359 0.728 ## 3 stringency_index 9.07 15.3 0.593 0.567 ## 4 handwashing_facilities 10.8 12.7 0.850 0.418 ## 5 gdp_per_capita 0.0748 0.0367 2.04 0.0719 ``` ``` ## [1] 0.5955702 ``` ``` ## [1] 0.4158236 ``` ``` ## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 441. 551. 0.799 0.469 ## 2 vaccination -3193. 3904. -0.818 0.459 ## 3 stringency_index -10.9 8.59 -1.27 0.272 ## 4 handwashing_facilities -4.23 9.59 -0.441 0.682 ## 5 gdp_per_capita 0.118 0.0379 3.11 0.0359 ``` ``` ## [1] 0.757986 ``` ``` ## [1] 0.5159721 ``` --- # Results & Implications - [2a: regression analyses - simple linear regression] --- # Results & Implications - [2b: regression analyses - other] 1. Fixed effect ```r #covid_data2 <- covid_data1 %>% # select(vaccine_introduced, total_vaccinations, date, first_vaccination, interaction_term, #location, new_cases_per_million, weekly_hosp_admissions_per_million) ``` ```r #covid_data3 <- covid_data1 %>% # select(date, location, new_cases_per_million, weekly_hosp_admissions_per_million, people_fully_vaccinated_per_hundred, new_deaths_per_million) %>% na.omit(people_fully_vaccinated_per_hundred, population_density, total_cases_per_million) ``` ```r #covid_data7 <- covid_data3 %>% # filter(location == "United States" | location == "United Kingdom" | location == "Canada" | location == "Belgium" | location == "Israel") ``` ```r #fixed.dum <-lm(weekly_hosp_admissions_per_million ~ people_fully_vaccinated_per_hundred + factor(location) - 1, data = covid_data7) #summary(fixed.dum) #yhat <- fixed.dum$fitted #scatterplot(yhat ~ covid_data7$people_fully_vaccinated_per_hundred | covid_data7$location, xlab ="people fully vaccinated per hundred", ylab ="covid cases", boxplots = FALSE, smooth = FALSE) #abline(lm(covid_data7$weekly_hosp_admissions_per_million~covid_data7$people_fully_vaccinated_per_hundred),lwd=5, col="red") ``` --- # Results & Implications - [2b: regression analyses - other] 2. Regression Discontinuity ```r rdd_data(y = covid_country$new_deaths_per_million, x = covid_country$vaccination, cutpoint = 0.1) %>% rdd_reg_lm(slope = "separate") %>% summary() ``` ``` ## ## Call: ## lm(formula = y ~ ., data = dat_step1, weights = weights) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.394 -3.513 -1.533 1.405 19.208 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.171 2.684 1.927 0.0596 . ## D -1.279 3.969 -0.322 0.7485 ## x 16.107 35.252 0.457 0.6497 ## x_right -23.037 36.680 -0.628 0.5328 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.42 on 51 degrees of freedom ## (160 observations deleted due to missingness) ## Multiple R-squared: 0.01997, Adjusted R-squared: -0.03768 ## F-statistic: 0.3464 on 3 and 51 DF, p-value: 0.7919 ``` ```r covid_country %>% select(vaccination, new_deaths_per_million) %>% mutate(threshold = as.factor(ifelse(vaccination >= 0.1, 1, 0))) %>% ggplot(aes(x = vaccination, y = new_deaths_per_million)) + geom_point(aes(color = threshold)) + geom_smooth(method = "lm", se = FALSE) + scale_color_brewer(palette = "Accent") + guides(color = FALSE) + geom_vline(xintercept = 0.1, color = "red", size = 1, linetype = "dashed") + labs(y = "New deaths (per million)", x = "Vaccination rate") + theme_minimal() ``` ``` ## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` ## instead. ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` ``` ## Warning: Removed 160 rows containing non-finite values (stat_smooth). ``` ``` ## Warning: Removed 160 rows containing missing values (geom_point). ``` <img src="presentation_files/figure-html/unnamed-chunk-8-1.png" width="80%" /> ```r covid_country %>% select(vaccination, new_deaths_per_million) %>% mutate(threshold = as.factor(ifelse(vaccination >= 0.1, 1, 0))) %>% ggplot(aes(x = vaccination, y = new_deaths_per_million, color = threshold)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + scale_color_brewer(palette = "Accent") + guides(color = FALSE) + geom_vline(xintercept = 0.1, color = "red", size = 1, linetype = "dashed") + labs(y = "New deaths (per million)", x = "Vaccination rate") + theme_minimal() ``` ``` ## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` ## instead. ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` ``` ## Warning: Removed 160 rows containing non-finite values (stat_smooth). ## Warning: Removed 160 rows containing missing values (geom_point). ``` <img src="presentation_files/figure-html/unnamed-chunk-8-2.png" width="80%" /> ```r covid_total <- covid_data %>% filter(date == "2022-02-20") covid_country <- covid_total %>% filter(is.na(continent) == FALSE) %>% mutate(vaccination = people_fully_vaccinated/population) %>% mutate(booster = total_boosters/population) %>% mutate(death_norm = (total_deaths_per_million - mean(total_deaths_per_million)) / sd(total_deaths_per_million)) rdd_data(y = covid_country$new_deaths_per_million, x = covid_country$vaccination, cutpoint = 0.5) %>% rdd_reg_lm(slope = "same") %>% summary() ``` ``` ## ## Call: ## lm(formula = y ~ ., data = dat_step1, weights = weights) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.6545 -2.0264 -1.3366 0.9876 15.0801 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.5810 0.8455 3.052 0.00303 ** ## D -0.7295 1.3692 -0.533 0.59556 ## x 2.0274 2.6807 0.756 0.45155 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.23 on 85 degrees of freedom ## (128 observations deleted due to missingness) ## Multiple R-squared: 0.007168, Adjusted R-squared: -0.01619 ## F-statistic: 0.3069 on 2 and 85 DF, p-value: 0.7366 ``` ```r covid_country %>% select(vaccination, new_deaths_per_million) %>% mutate(threshold = as.factor(ifelse(vaccination >= 0.5, 1, 0))) %>% ggplot(aes(x = vaccination, y = new_deaths_per_million)) + geom_point(aes(color = threshold)) + geom_smooth(method = "lm", se = FALSE) + scale_color_brewer(palette = "Accent") + guides(color = FALSE) + geom_vline(xintercept = 0.5, color = "red", size = 1, linetype = "dashed") + labs(y = "New deaths (per million)", x = "Vaccination rate") + theme_minimal() ``` ``` ## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` ## instead. ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` ``` ## Warning: Removed 128 rows containing non-finite values (stat_smooth). ``` ``` ## Warning: Removed 128 rows containing missing values (geom_point). ``` <img src="presentation_files/figure-html/unnamed-chunk-10-1.png" width="80%" /> ```r covid_country %>% select(vaccination, new_deaths_per_million) %>% mutate(threshold = as.factor(ifelse(vaccination >= 0.5, 1, 0))) %>% ggplot(aes(x = vaccination, y = new_deaths_per_million, color = threshold)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + scale_color_brewer(palette = "Accent") + guides(color = FALSE) + geom_vline(xintercept = 0.5, color = "red", size = 1, linetype = "dashed") + labs(y = "New deaths (per million)", x = "Vaccination rate") + theme_minimal() ``` ``` ## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` ## instead. ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` ``` ## Warning: Removed 128 rows containing non-finite values (stat_smooth). ## Warning: Removed 128 rows containing missing values (geom_point). ``` <img src="presentation_files/figure-html/unnamed-chunk-10-2.png" width="80%" /> --- # Results & Implications - [3: machine learnng] 1. linear regression 2. logistic regression 3. KNN (clusters) https://github.com/allisonhorst/stats-illustrations/ --- class: inverse, middle, center # Limitations and Potential Future Studies --- # Limitations - details --- # Future Studies Future studies --- # References [1] Want to find out more about `xaringan`? See https://slides.yihui.name/xaringan/#1. [2] You are welcomed to use the default styling of the slides. In fact, that's what I expect majority of you will do. You will differentiate yourself with the content of your presentation. [3] [4] ---